Excel BI - Excel Challenge 865

excel-challenges

excel-formulas

🔰 Many legacy machines are dumping logs where the data position varies wildly.

Published

March 24, 2026

Illustration for Excel BI - Excel Challenge 865

Challenge Description

🔰 Many legacy machines are dumping logs where the data position varies wildly. The combined log is given. Extract following -

Solutions

library(tidyverse)
library(readxl)

path <- "Excel/800-899/865/865 Complex Regex.xlsx"
input <- read_excel(path, range = "A2:A95")
test <- read_excel(path, range = "C2:E95")

result <- input %>%
  mutate(
    Date = ymd(str_extract(Data, "[0-9]{2,4}[/|-][0-9]{1,2}[/|-][0-9]{1,2}")),
    `Product ID` = str_extract(Data, "[A-Z]{2}[0-9]{3}"),
    `Weight in Kg` = {
      w <- str_extract(
        Data,
        "(kg|kgs|gm|gms)\\s*[0-9]+\\.?[0-9]*|[0-9]+\\.?[0-9]*\\s*(kg|kgs|gm|gms)"
      )
      val <- as.numeric(str_extract(w, "[0-9]+\\.?[0-9]*"))
      unit <- str_extract(w, "kg|kgs|gm|gms")
      case_when(
        unit %in% c("gm", "gms") ~ val / 1000,
        unit %in% c("kg", "kgs") ~ val,
        TRUE ~ NA_real_
      )
    }
  ) %>%
  select(-Data)

all.equal(result, test, check.attributes = FALSE)
# [1] TRUE

Logic: Read the workbook ranges needed for the challenge; Derive the required intermediate columns; Parse the packed text or string structure; Apply the business rule conditions explicitly.
Strengths: The solution stays close to the text pattern itself, which makes the extraction logic easy to audit.
Areas for Improvement: The solution assumes the workbook layout and selected ranges remain stable, so any structural change in the sheet would require small adjustments.
Gem: A small number of well-targeted text patterns does most of the heavy lifting.

import pandas as pd
import re
import numpy as np

path = "Excel/800-899/865/865 Complex Regex.xlsx"
input = pd.read_excel(path, usecols="A", skiprows=1, nrows=93)
test = pd.read_excel(path, usecols="C:E", skiprows=1, nrows=93)

def extract_date(s):
    m = re.search(
        r"(24|2024)[-/](\d{1,2})[-/](\d{1,2})",
        s or ""
    )
    return pd.to_datetime(m.group(0), errors="coerce", dayfirst=False, yearfirst=True) if m else pd.NaT

def extract_product_id(s):
    m = re.search(r"[A-Z]{2}[0-9]{3}", s or "")
    return m.group(0) if m else None

def extract_weight_kg(s):
    m = re.search(r"(kg|kgs|gm|gms)\s*[0-9]+\.?[0-9]*|[0-9]+\.?[0-9]*\s*(kg|kgs|gm|gms)", s or "")
    if not m: return np.nan
    v = re.search(r"[0-9]+\.?[0-9]*", m.group(0))
    u = re.search(r"kg|kgs|gm|gms", m.group(0))
    if not v or not u: return np.nan
    val, unit = float(v.group(0)), u.group(0)
    return val/1000 if unit in ("gm","gms") else val if unit in ("kg","kgs") else np.nan

result = pd.DataFrame({
    "Date": input.iloc[:,0].apply(extract_date),
    "Product ID": input.iloc[:,0].apply(extract_product_id),
    "Weight in Kg": input.iloc[:,0].apply(extract_weight_kg)
})

result["Weight in Kg"] = result["Weight in Kg"].astype("int64")

print(result.equals(test)) #

The Python version expresses the core extraction rule directly and keeps the pattern matching easy to review.

Difficulty Level

Medium

The individual steps are manageable, but the correct transformation pattern is not obvious from the raw data.